<fix>[vm]: 修复迁移失败锁回滚#3977
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
Walkthrough在迁移失败处理链路中增加对目的宿主的 VM 状态回查;根据回查结果在目的端完成迁移(更新 DB 并触发扩展点)或在源端回滚;并新增集成测试覆盖目标主机返回 Running 与非 Running 两条路径。 变更说明VM 迁移失败处理流程重构
Sequence Diagram(s)sequenceDiagram
participant Client
participant VmInstanceBase
participant DestHost
participant Database
participant ExtEmitter
Client->>VmInstanceBase: 发起 MigrateVmAction(迁移)
VmInstanceBase->>DestHost: 发送 migrate 请求(失败回调)
VmInstanceBase->>DestHost: CheckVmStateOnHypervisorMsg(getVmStateOnHost)
DestHost-->>VmInstanceBase: 回复 VM 状态("Running" / "Stopped" / "Paused")
alt 状态为 Running
VmInstanceBase->>DestHost: checkState
VmInstanceBase->>Database: 更新 zone/cluster/lastHostUuid/hostUuid 并刷新 VM
VmInstanceBase->>ExtEmitter: postMigrateVm
ExtEmitter-->>VmInstanceBase: 返回
VmInstanceBase->>ExtEmitter: afterMigrateVm
ExtEmitter-->>VmInstanceBase: 返回
VmInstanceBase-->>Client: completion.success()
else 状态非 Running 或 查询失败
VmInstanceBase->>ExtEmitter: failedToMigrateVm
ExtEmitter-->>VmInstanceBase: 返回
VmInstanceBase->>DestHost: (若特定错误码)对原宿主再次 checkState
VmInstanceBase-->>Client: completion.fail(err)
end
代码审查工作量🎯 4 (复杂) | ⏱️ ~45 分钟 诗歌
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@compute/src/main/java/org/zstack/compute/vm/VmInstanceBase.java`:
- Around line 7248-7257: The recovery-success branch skips the
VmMigratePostCallExtensionFlow.postMigrateVm() extension and only calls
completeMigrateVmOnDestination()/extEmitter.afterMigrateVm(), causing divergence
from the normal success path; update the recovery-success path to invoke the
same full post-migrate sequence as the normal success flow (i.e. run
VmMigratePostCallExtensionFlow.postMigrateVm() then
extEmitter.afterMigrateVm()), or refactor the success cleanup into a shared
helper and call that from both places (ensure postMigrateVm is executed before
afterMigrateVm); modify the code paths around completeMigrateVmOnDestination,
postMigrateVm, and extEmitter.afterMigrateVm to reuse the unified cleanup flow.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: http://open.zstack.ai:20001/code-reviews/zstack-cloud.yaml (via .coderabbit.yaml)
Review profile: CHILL
Plan: Pro
Run ID: ccf6c20e-60bc-4277-945b-ad11b7e778a1
📒 Files selected for processing (2)
compute/src/main/java/org/zstack/compute/vm/VmGlobalConfig.javacompute/src/main/java/org/zstack/compute/vm/VmInstanceBase.java
|
Comment on Comment from shan.wu:
Addressed: recovery-success cleanup now calls |
8d3ba77 to
d3f3fe1
Compare
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@compute/src/main/java/org/zstack/compute/vm/VmInstanceBase.java`:
- Around line 7209-7238: handleFailedMigrateVm only checks the destination host
once (via getVmStateOnHost) and immediately calls completeMigrateVmOnDestination
if the VM is running on destination, which can prematurely treat migration as
complete while the source is still alive; update handleFailedMigrateVm to (a)
when destination reports the VM running, also verify the source host state (use
getVmStateOnHost for lastHostUuid and isVmRunningOnHost) and only call
completeMigrateVmOnDestination if the source is confirmed not running there, or
(b) implement a retry loop with interval and timeout that rechecks destination
and source states before deciding, and fall back to rollbackFailedMigrateVm if
checks/timeouts fail; touch the methods handleFailedMigrateVm, the
ReturnValueCompletion callbacks, and reuse
rollbackFailedMigrateVm/completeMigrateVmOnDestination to enforce the correct
gating.
- Around line 205-206: The helper isVmRunningOnHost(String state) currently
treats only VmInstanceState.Running as a successful post-migration state; update
it to consider other valid landed states (at minimum VmInstanceState.Paused in
addition to Running) so that a VM which was Paused before migration and remains
Paused on the target is treated as a successful recovery rather than triggering
rollback; modify the method (and the analogous checks referenced around lines
7218-7223) to return true for VmInstanceState.Running.toString().equals(state)
|| VmInstanceState.Paused.toString().equals(state) (or an equivalent check
against an allowed-success set) and keep the method name isVmRunningOnHost
unchanged.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: http://open.zstack.ai:20001/code-reviews/zstack-cloud.yaml (via .coderabbit.yaml)
Review profile: CHILL
Plan: Pro
Run ID: 3d10bac5-21f9-4fc4-92d1-88cb057aaf17
📒 Files selected for processing (1)
compute/src/main/java/org/zstack/compute/vm/VmInstanceBase.java
| private boolean isVmRunningOnHost(String state) { | ||
| return VmInstanceState.Running.toString().equals(state); |
There was a problem hiding this comment.
恢复成功判定不要只接受 Running。
正常成功路径已经保留了迁移前为 Paused 的语义,但这里把“已迁移完成”硬编码成了目的端必须是 Running。这样一来,暂停态 VM 在目的端已经成功落地且宿主返回 Paused 时,恢复分支仍会误走回滚,和正常迁移成功路径的行为不一致。
💡 建议修改
- private boolean isVmRunningOnHost(String state) {
- return VmInstanceState.Running.toString().equals(state);
+ private boolean isVmCompletedOnHost(String state, VmInstanceState originState) {
+ if (VmInstanceState.Running.toString().equals(state)) {
+ return true;
+ }
+
+ return originState == VmInstanceState.Paused
+ && VmInstanceState.Paused.toString().equals(state);
}- if (!isVmRunningOnHost(state)) {
+ if (!isVmCompletedOnHost(state, originState)) {
rollbackFailedMigrateVm(originState, destHostUuid, errCode, completion);
return;
}Also applies to: 7218-7223
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@compute/src/main/java/org/zstack/compute/vm/VmInstanceBase.java` around lines
205 - 206, The helper isVmRunningOnHost(String state) currently treats only
VmInstanceState.Running as a successful post-migration state; update it to
consider other valid landed states (at minimum VmInstanceState.Paused in
addition to Running) so that a VM which was Paused before migration and remains
Paused on the target is treated as a successful recovery rather than triggering
rollback; modify the method (and the analogous checks referenced around lines
7218-7223) to return true for VmInstanceState.Running.toString().equals(state)
|| VmInstanceState.Paused.toString().equals(state) (or an equivalent check
against an allowed-success set) and keep the method name isVmRunningOnHost
unchanged.
aa8f1e4 to
8a602fb
Compare
Check the destination host when the hypervisor migration call fails. If the VM is Running there, continue the normal migration flow. That lets DB sync and post hooks run through the standard path. Otherwise fail the flow and keep the rollback behavior. Resolves: ZSTAC-83894 Change-Id: I8b4774a405fc3b1c05d21b6742facd26bc8d03e6
8a602fb to
8e7ecd6
Compare
Root Cause
Migration API failure was always handled as a normal migration failure rollback. In practice, libvirt migration may have already completed or may still be converging while the API call reports failure. In that window, both source and destination hosts can report the VM as alive, and rolling back immediately can leave LV lock ownership inconsistent with the actual VM runtime side.
Solution
Test
git diff --checkpassed.mvn -pl compute -am -DskipTests compilewas attempted but stopped in the existingnetworkmodule class-resolution issue before reaching compute.sync from gitlab !9869